Update env, cmake, GEOS_Util and MAPL releases in components.yaml & update README.md after decommissioning of SLES12 at NCCS by mathomp4 · Pull Request #796 · GEOS-ESM/GEOSldas

mathomp4 · 2025-02-26T19:29:15Z

This non-0-diff PR updates the components.yaml to approximately match that of GEOSgcm main as of 2025-Mar-19.

env:       v4.29.1 --> v4.36.0
cmake:     v3.52.0 --> v3.57.0
GEOS_Util: v2.1.3  --> v2.1.6
MAPL:      v2.50.1 --> v2.54.2

The non-0-diff changes are within "roundoff," and are caused by the newer compiler/baselibs version. Intel tests with standard optimization are 0-diff when bit shaving is not used.

Note that the ESMA_env, ESMA_cmake, and MAPL versions are slightly newer than those of GEOSgcm main, but per the respective release notes this should be 0-diff w.r.t. what is in GEOSgcm main (but not 0-diff w.r.t. what is on GEOSldas develop before this PR!).

The PR also updates README.md to reflect that SLES15 is now the only O/S on the NCCS Discover platform.

Earlier versions of this PR also included the helfsurface() optimization of GEOS-ESM/GMAO_Shared#348, which requires a newer Intel compiler but should be zero-diff (which is why it will be done in a separate PR).

biljanaorescanin · 2025-03-11T12:02:32Z

Testing summary:
Almost all tests comparison will fail. Confirming PR is not zero diff. All differences are roundoff.

Baselibs change from GCC 12 to GCC 14 the reason for all GNU fails.
Intel fails are from new compilers.

`Runtype Clone Build Build Time Model Run/Compare Assim Run/Compare

conus pass pass 13 min pass/FAIL -- / --
global -- -- -- pass/FAIL pass/pass
globalcs -- -- -- pass/FAIL pass/pass
globalcnclm4 -- -- -- pass/FAIL -- / --
debugconus -- pass 11 min pass/pass -- / --
aggconus -- pass 14 min pass/FAIL -- / --
aggglobal -- -- -- pass/FAIL pass/FAIL
aggglobalcs -- -- -- pass/FAIL pass/FAIL
aggglobalcnclm4 -- -- -- pass/FAIL -- / --
gnuconus pass pass 31 min pass/FAIL -- / --
gnuglobal -- -- -- pass/FAIL pass/FAIL
gnuglobalcs -- -- -- pass/FAIL pass/FAIL
gnuglobalcnclm4 -- -- -- pass/FAIL -- / --
gnudebugconus -- pass 20 min pass/pass -- / --`

Note: Helfand is not used as option during testing ( we use Louis as default) so for PR this change to use branch is trivially zero diff.

mathomp4 · 2025-03-11T13:26:57Z

Hmm. You are getting failures from the new Intel? Are these build-time or run-time? That is, did things crash or just get different answers?

gmao-rreichle · 2025-03-11T13:28:30Z

@biljanaorescanin, @mathomp4,

I'm a bit confused by this PR and think we need to separate the update of the environment and the Helfand update. Specifically:

It is not correct that only Louis is used. The GLOBALCS/assim tests use Helfand:

LDAS_AGGGLOBALCS/assim/CURRENT/run/LDAS.rc:CHOOSEMOSFC:                        1
LDAS_GLOBALCS/assim/CURRENT/run/LDAS.rc:CHOOSEMOSFC:                        1
LDAS_GNUGLOBALCS/assim/CURRENT/run/LDAS.rc:CHOOSEMOSFC:                        1

I'm surprised that the comparison passes for the GLOBAL/assim test but fails for the GLOBAL/model test (and similarly for other tests).

I think it would be best to remove the Helfand branch from the PR and examine the impact of the environment update in isolation.

biljanaorescanin · 2025-03-11T13:38:59Z

I forgot CS uses Helfand.
I'm running now without the helfand branch. I did have that run few weeks back but I removed it and didn't leave a comment on the PR of what the answer was.

biljanaorescanin · 2025-03-11T17:01:05Z

Running without helfand branch is zero diff to running with helfand branch for intel tests confirming again helfand vectorization is zero diff.
You will see below GNU tests fails and I think that is because for tests to pass we need baselibs GCC 14.
Maybe it can also be a discover glitch. Not 100% sure now since build failed.
For my previous test summary @mathomp4 changed his sandbox to GCC 14.

gmao-rreichle · 2025-03-11T18:07:41Z

Thanks, @biljanaorescanin. Here are my 2c:

Running without helfand branch is zero diff to running with helfand branch for intel tests confirming again helfand vectorization is zero diff.

This is great, but I still think we want this to be in a separate PR for clarity. When releases are made, the release doc is basically a collection of PR titles. Having a separate PR for the zero-diff helfsurface() optimization change makes it much easier to understand what was done when a few months have passed and nobody can remember off the top of their head. I edited the present PR accordingly. Once the present PR has been merged, we can test and merge the GMAO_Shared helfsurface() optimzation PR GEOS-ESM/GMAO_Shared#348

You will see below GNU tests fails and I think that is because for tests to pass we need baselibs GCC 14. Maybe it can also be a discover glitch. Not 100% sure now since build failed.
For my previous test summary @mathomp4 changed his sandbox to GCC 14.

What does it take to include the GCC-14 change into this PR? It doesn't make sense to me to merge this PR when it doesn't work with the current GNU version. Maybe I'm missing something.

Also, I'm still surprised that the comparison passes for the GLOBAL/assim test but fails for the GLOBAL/model test (and similarly for other tests). This could be a difference in 1d (tile) vs. 2d output and MAPL HISTORY regridding. Before we can merge the PR, we need to understand better what exactly is not zero-diff here.

biljanaorescanin · 2025-03-11T19:20:22Z

If I only focus to intel you will see only NC4 files fail and it is for roundoff:
conus/cmp_model.conus.log:Exception: Comparing outputs failed! BIN: True, NC4: False, RST: True global/cmp_model.global.log:Exception: Comparing outputs failed! BIN: True, NC4: False, RST: True globalcnclm4/cmp_model.globalcnclm4.log:Exception: Comparing outputs failed! BIN: True, NC4: False, RST: True globalcs/cmp_model.globalcs.log: Comparing outputs failed! BIN: True, NC4: False, RST: True

mathomp4 · 2025-03-11T23:12:17Z

Thanks, @biljanaorescanin. Here are my 2c:

You will see below GNU tests fails and I think that is because for tests to pass we need baselibs GCC 14. Maybe it can also be a discover glitch. Not 100% sure now since build failed.
For my previous test summary @mathomp4 changed his sandbox to GCC 14.

What does it take to include the GCC-14 change into this PR? It doesn't make sense to me to merge this PR when it doesn't work with the current GNU version. Maybe I'm missing something.

This is actually an issue with the scripting. In the regression scripts, for GNU runs I have to replace the g5_modules with one for GNU. At the moment, the LDAS scripting still uses a GCC 12 g5_modules (since moving to GCC 14 would be non-zero-diff).

I can change that in the scripting, and then the GNU tests would go NZD the next time things run.

…elsurface

tick up minor release, should be 0-diff per respective release notes

gmao-rreichle

See inline comments below.

README.md

gmao-rreichle · 2025-03-19T22:07:36Z

components.yaml

  local: ./@env
  remote: ../ESMA_env.git
-  tag: v4.29.1
+  tag: v4.36.0


@biljanaorescanin, @mathomp4 : I ticked up the versions of env, cmake, and MAPL (c1007c8). Based on the documentation of the respective releases, this should be zero-diff w.r.t. what was on the PR before my latest edits (but definitely non-0-diff w.r.t. current develop). @mathomp4, please let me know if you have any objections or suggestions. @biljanaorescanin, when you get a chance, please re-test the PR. If all is as expected, the new test is 0-diff w.r.t. the most recent test (if you still have a copy).

biljanaorescanin · 2025-03-20T16:04:39Z

Tests are zero diff to previous iteration of testing.
We don't get GNU tests since I didn't use right baselibs that needs to be changed on Matt's side, but we do have results from before if anyone want's to take a look it was roundoff difference.

gmao-rreichle · 2025-03-20T21:54:13Z

@mathomp4, @biljanaorescanin, @weiyuan-jiang:

I am still trying to understand the very unusual non-0-diff character of this PR. Specifically, the LDAS_GLOBAL/model test fails the comparison for the nc4 files (in just a small subset of variables, and within what seems to be roundoff). The curious thing is that the LDAS_GLOBAL/assim test passes!

If there was any change in the science code (or a roundoff change in the science calcs triggered by the newer env/baselibs), then the assim test should also fail the nc4 comparison. The fact that the assim test passes suggests that it's something in MAPL and/or the LDAS tile_bin2nc4 utility. That is, the variables would need to be 0-diff when they in memory during the simulation, but then something changes when the data are written out.

I noticed that the Intel tests with standard optimization that do pass have no bit shaving in HISTORY.rc, whereas the tests that fail the nc4 comparison have bit shaving enabled. I went through the documentation of the MAPL releases between 2.50.1 and 2.54.2 and didn't notice anything that might have impacted the bit shaving, and the documentation suggests that for the most part the MAPL releases in question should all be 0-diff among themselves (for the GCM, which is usually a bigger hurdle than LDAS when it comes to 0-diff). So I can't see how exactly the bit shaving might cause the non-0-diffs seen here, but I also can't quite rule it out.

Thoughts?

weiyuan-jiang · 2025-03-21T13:41:51Z

The failed comparison in model run is on files that are not in the assim run. So we probably don't need to worry about this. What happened to GCM's history output with bit shaving? @mathomp4

gmao-rreichle · 2025-03-21T14:26:23Z

The failed comparison in model run is on files that are not in the assim run. So we probably don't need to worry about this.

@weiyuan-jiang, I'm not sure I understand the reasoning. Of course the "model" test case has different HISTORY output. What I'm after is understanding which exact changes in the PR caused the non-0-diff result for the model test case. Normally, anything that causes non-0-diff in the model test case would also cause non-0-diff in the assim test case. The fact that the model case is non-0-diff but the assim case is 0-diff is very unusual, and I'd really like to be able to explain this so we can make more informed decisions about how to interpret the non-0-diff changes in science applications going forward

gmao-rreichle · 2025-03-24T14:59:43Z

For lack of a better idea I just tested a variant of the PR's branch that reverts MAPL back to 2.50.1. I only ran the Intel tests w/ standard optimization. The result is 0-diff w.r.t. using MAPL 2.54.2, so MAPL is not the cause of the non-0-diff result vs. develop. Still a mystery to me why we get non-0-diff for output from the "model" test but 0-diff for the "assim" test.

biljanaorescanin · 2025-03-31T18:48:19Z

In our regression testing we got test fail: global -- -- -- pass/FAIL pass/pass

If I run just GLOBAL/model test and comment out in HISTORY.rc *.nbits: 12, so we are not using bit shaving, then our develop vs branch comparison is zero diff for both collection comparisons.
In our GLOBAL/assim run we don't use bit shaving in history.rc and that is why run was a pass for comparison.

mathomp4 added 2 commits February 26, 2025 14:27

WIP: Testing helfsurface update in GEOS_Util

4c64bb2

wrong repo

07c1daf

mathomp4 added the Not 0-diff label Feb 26, 2025

mathomp4 self-assigned this Feb 26, 2025

mathomp4 mentioned this pull request Feb 26, 2025

vectorize helfsurface() and related subroutines GEOS-ESM/GMAO_Shared#348

Merged

remove SLES12 from readme

6d45738

gmao-rreichle added 2 commits March 11, 2025 14:00

removing helfsurface optimization branch from components.yaml

c80a008

fixing typo in previous commit (components.yaml)

b928b6b

gmao-rreichle changed the title ~~WIP: Testing helfsurface update in GEOS_Util~~ WIP: Update components.yaml to match GEOSgcm main as of 2025-Feb-26 Mar 11, 2025

weiyuan-jiang added 2 commits March 13, 2025 09:42

Merge branch 'feature/wjiang/rm_sles12' into feature/wjiang/cleanup_h…

affb351

…elsurface

change README

60bf8a9

weiyuan-jiang mentioned this pull request Mar 13, 2025

remove SLES12 from readme #797

Closed

gmao-rreichle changed the title ~~WIP: Update components.yaml to match GEOSgcm main as of 2025-Feb-26~~ Update env, cmake, GEOS_Util and MAPL releases in components.yaml & update README.md after decommissioning of SLES12 at NCCS Mar 19, 2025

gmao-rreichle added 2 commits March 19, 2025 17:27

Update components.yaml

c1007c8

tick up minor release, should be 0-diff per respective release notes

Updated README.md

1dba0d7

gmao-rreichle reviewed Mar 19, 2025

View reviewed changes

Additional correction of SLES15-related text in README.md

8ab19d8

edited location of "Release" build in README.md

e6c3eac

gmao-rreichle added the documentation Improvements or additions to documentation label Apr 2, 2025

gmao-rreichle marked this pull request as ready for review April 2, 2025 16:28

gmao-rreichle requested review from a team as code owners April 2, 2025 16:28

gmao-rreichle approved these changes Apr 2, 2025

View reviewed changes

gmao-rreichle merged commit 6455631 into develop Apr 2, 2025
11 checks passed

gmao-rreichle deleted the feature/wjiang/cleanup_helsurface branch April 2, 2025 16:29

Conversation

mathomp4 commented Feb 26, 2025 • edited by gmao-rreichle Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

biljanaorescanin commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathomp4 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmao-rreichle commented Mar 11, 2025

Uh oh!

biljanaorescanin commented Mar 11, 2025

Uh oh!

biljanaorescanin commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmao-rreichle commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

biljanaorescanin commented Mar 11, 2025

Uh oh!

mathomp4 commented Mar 11, 2025

Uh oh!

gmao-rreichle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gmao-rreichle Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

biljanaorescanin commented Mar 20, 2025

Uh oh!

gmao-rreichle commented Mar 20, 2025

Uh oh!

weiyuan-jiang commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmao-rreichle commented Mar 21, 2025

Uh oh!

gmao-rreichle commented Mar 24, 2025

Uh oh!

biljanaorescanin commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mathomp4 commented Feb 26, 2025 •

edited by gmao-rreichle

Loading

biljanaorescanin commented Mar 11, 2025 •

edited

Loading

mathomp4 commented Mar 11, 2025 •

edited

Loading

biljanaorescanin commented Mar 11, 2025 •

edited

Loading

gmao-rreichle commented Mar 11, 2025 •

edited

Loading

weiyuan-jiang commented Mar 21, 2025 •

edited

Loading